This report is continuation of the first phase. First phase was an overview of data and being familiar with its concept and a brief overview on the data like distribution of data in multiple levels. After that I tried to meet microbiologists and be more familiar. In this phase, I’m going to explore the data and find features. There are multiple problems like normalization and dimension reduction that I’m going to test multiple ways to study the data and find responsible factors for the phenotypes.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(highcharter)
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(stringr)
library(tidyr)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(Rtsne)
library(pheatmap)
library(limma)
library(coin)
## Loading required package: survival
library(destiny)
Rows are multiple OTUs and columns are samples. Also there is a column that is more description about the data and can be collected as a vector and there is a mapping between ‘OTU ID’ and taxonomy.
microbeCounts <- read_tsv("./data/16s_taxonomic_profiles.tsv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## `#OTU ID` = col_character(),
## taxonomy = col_character()
## )
## See spec(...) for full column specifications.
microbeTaxonomy <- microbeCounts$taxonomy
microbeCounts %>%
select(-c(taxonomy)) -> microbeCounts
Now loading the metadata:
metadata <- read_csv("./data/hmp2_metadata.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## week_num = col_integer(),
## interval_days = col_integer(),
## visit_num = col_integer(),
## total_reads = col_integer(),
## IntervalSequence = col_integer(),
## ProjectSpecificID = col_integer(),
## `Age at diagnosis` = col_integer(),
## `Number of flora tubes collected:` = col_integer(),
## `Rectum Flora:` = col_integer(),
## `Ileum flora:` = col_integer(),
## `Other inflamed flora:` = col_integer(),
## `Non-inflamed flora:` = col_integer(),
## `Number of tubes collected for epithelial cell biopsies:` = col_integer(),
## `Rectum cell biopsy:` = col_integer(),
## `Ileum cell biopsy:` = col_integer(),
## `Other non-inflamed cell biopsy:` = col_integer(),
## `Number of DNA/RNA tubes collected:` = col_integer(),
## consent_age = col_integer(),
## `If no, in what year did you come to the United States?` = col_integer(),
## `If yes, how many times?` = col_integer()
## # ... with 24 more columns
## )
## See spec(...) for full column specifications.
length(colnames(metadata))
## [1] 489
There are plenty of factors that can be studied and I’m going to study effect of multiple factors and their correlation with a disease named IBD that is described more in last reports.
unique(metadata$diagnosis)
## [1] "CD" "UC" "nonIBD"
This disease has two types named “CD” and “UC” and I’m using them like each other!
metadata %>%
filter(`External ID` %in% colnames(microbeCounts)) -> metadata
groups <- metadata %>%
select(`External ID` , diagnosis) %>%
mutate(isPatient = ifelse(diagnosis == "CD" | diagnosis == "UC" , 1 , 0)) %>%
select(`External ID` , isPatient)
sum(groups == 1)
## [1] 133
sum(groups == 0)
## [1] 45
In order to see correlation between samples, we can use heatmp:
microbeCounts <- as.data.frame(microbeCounts)
rownames(microbeCounts) <- microbeCounts$`#OTU ID`
microbeCounts <- microbeCounts %>%
select(-c(`#OTU ID`))
pheatmap(microbeCounts)
It seems that we should look at statistics other than correlation!
As we have 982 features and 178 samples, we can use lower dimensions to see if the samples can be clustered and the quality of clustering.
In PCA we try to find a linear combination of the features. Here means that linear combination of microbes. Lets see what happens:
pc <- prcomp(x = microbeCounts, center = F , scale. = F)
eigv = round(pc$sdev^2/sum(pc$sdev^2)*100, 2)
eigv = data.frame(c(1:178),eigv)
names(eigv) = c('PCs','Variance')
plot(eigv,type = "b",col = "darkblue",lwd = 2);grid()
We can visualize PCA in multiple ways to see its usefulness:
plot(summary(pc)$importance[3,], type="l",
ylab="%variance explained", xlab="nth component (decreasing order)")
abline(h=0.99,col="red");abline(v = 32,col="red",lty=3)
So it seems that with first PCs contain less than \(\%80\) of the total variance. But we can see how are the clusters in lower dimensions:
pcr <- data.frame(pc$r[,1:3] , group = groups$isPatient)
ggplot(pcr) +
geom_point(aes(x = PC1 , y = PC2, color = group)) + theme_bw()
It is not satisfying in 2 dimensions. Now visualizing PCA in three dimensions:
plot_ly(data = pcr , x = ~PC1 , y = ~PC2 , z = ~PC3 , color = ~group)
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
It is not satisfyable. Maybe we can use non-linear methods and get better results. Also It is better to try centering and scaling to somehow normalizing the dataset for PCA:
pc <- prcomp(x = microbeCounts, center = T , scale. = T)
plot(summary(pc)$importance[3,], type="l",
ylab="%variance explained", xlab="nth component (decreasing order)")
abline(h=0.99,col="red");abline(v = 32,col="red",lty=3)
pcr <- data.frame(pc$r[,1:3] , group = groups$isPatient)
ggplot(pcr) +
geom_point(aes(x = PC1 , y = PC2, color = group)) + theme_bw()
No!
tSNE trys to keep the structure of data and projects it in a way that distance between points in higher dimensions be kept in lower dimensions:(this tSNE is between microbes to see whether they have a specific structure or not)
rtsneObj <- Rtsne::Rtsne(X = microbeCounts , dim = 2 , perplexity = 20 , check_duplicates = F)
tsneCoords <- rtsneObj$Y
colnames(tsneCoords) <- c("X" , "Y")
ggplot(as.data.frame(tsneCoords)) +
geom_point(aes(x = X , y = Y)) +
ggtitle("tSNE on Microbes")
So lets try tSNE on the samples and see whether they can be classified or not:
rtsneObj <- Rtsne::Rtsne(X = t(as.matrix(microbeCounts)) , dim = 2 , perplexity = 20 , check_duplicates = F ,
pca_center = F , pca_scale =F)
tsneCoords <- as.data.frame(rtsneObj$Y)
colnames(tsneCoords) <- c("X" , "Y")
tsneCoords$group <- groups$isPatient
ggplot(as.data.frame(tsneCoords)) +
geom_point(aes(x = X , y = Y, color = group)) +
ggtitle("tSNE on Microbes")
No! it does not! We can also try it on lower perplexities:
rtsneObj <- Rtsne::Rtsne(X = t(as.matrix(microbeCounts)) , dim = 2 , perplexity = 5 , check_duplicates = F,
pca_center = F , pca_scale = F)
tsneCoords <- as.data.frame(rtsneObj$Y)
colnames(tsneCoords) <- c("X" , "Y")
tsneCoords$group <- groups$isPatient
ggplot(as.data.frame(tsneCoords)) +
geom_point(aes(x = X , y = Y, color = group)) +
ggtitle("tSNE on Microbes")
Maybe we can use another dimension reduction methods that are not linear.
Maybe we can translate composition as a linear and simple transform of microbes and their counts. But as we have a lot of features and few samples, we should remove some features and select the others:
tmp <- microbeCounts
microbesSum <- rowSums(tmp)
tmp$sumOfCounts <- microbesSum
tmp %>%
filter(sumOfCounts > 470) -> tmp
tmp %>%
select(-c(sumOfCounts)) -> tmp
microbeCountsWithPheno <- as.data.frame(t(as.data.frame(tmp)))
microbeCountsWithPheno$isPatient <- groups$isPatient
# To use regression we should only keep
fit <- lm(isPatient ~ . , data = microbeCountsWithPheno)
summary(fit)
##
## Call:
## lm(formula = isPatient ~ ., data = microbeCountsWithPheno)
##
## Residuals:
## 206646 224324 206619 224326 206624 219644
## -9.295e-06 -2.216e-04 4.353e-04 1.971e-05 -1.132e-02 2.112e-04
## 214995 215058 206750 206617 206626 219638
## 3.986e-04 -2.210e-05 -2.717e-04 -3.256e-03 -3.021e-02 -9.628e-04
## 224330 224328 206710 206676 206762 206635
## -2.972e-04 -1.836e-03 4.465e-04 2.721e-04 -1.969e-03 -4.612e-05
## 206702 206769 206534 215007 206678 206752
## 2.056e-03 -6.464e-03 3.174e-04 -1.079e-04 -1.248e-03 -7.062e-04
## 206675 206572 219652 219693 206738 215048
## -2.580e-04 2.672e-04 -2.662e-04 3.923e-04 -1.480e-04 -1.887e-17
## 215076 206720 206630 206758 219668 206669
## 6.793e-04 3.011e-04 -4.515e-04 -5.972e-04 5.739e-03 4.870e-02
## 206741 206659 215056 206684 206564 206732
## 1.163e-04 -1.430e-04 5.902e-04 -5.778e-04 9.055e-04 9.144e-06
## 206768 219654 206648 206561 206672 206747
## 2.502e-03 -1.626e-04 -2.470e-04 -1.567e-04 7.886e-04 1.498e-03
## 219637 206681 219633 206643 206671 206695
## 4.783e-01 -1.125e-03 1.748e-04 6.629e-04 -1.390e-03 2.481e-04
## 215010 215074 206766 206704 206682 206605
## 3.025e-05 8.410e-04 1.060e-03 7.440e-04 -9.797e-03 7.037e-04
## 206660 206620 215080 206756 219659 215009
## 3.292e-05 -3.830e-02 3.708e-04 -2.224e-04 1.365e-03 -4.034e-05
## 219676 206770 206656 206645 206730 206700
## 9.435e-05 7.611e-05 -6.047e-04 2.592e-03 1.862e-03 3.300e-06
## 215084 206713 206725 219651 206729 206734
## 1.913e-03 -3.231e-04 -3.467e-04 -4.969e-05 -1.419e-04 -2.296e-04
## 206727 219646 215004 206629 206614 206719
## 9.321e-05 -8.042e-04 -1.180e-02 -1.691e-04 -2.110e-03 2.220e-03
## 206547 206724 206622 206603 206563 206739
## -2.983e-05 -1.194e-05 6.335e-04 8.906e-04 -5.654e-18 -1.240e-04
## 206636 215023 206743 224845 219634 224327
## -7.339e-03 -6.884e-06 -2.875e-04 1.820e-04 -2.133e-04 2.684e-03
## 206623 219692 206658 224325 206742 206618
## 1.348e-02 -4.767e-02 9.031e-06 -2.300e-05 1.294e-04 -4.471e-04
## 206609 206708 206647 219645 215050 224323
## -8.849e-04 -9.518e-04 7.939e-05 5.733e-04 1.506e-05 -1.045e-05
## 206644 206755 206677 206711 206703 206767
## -1.462e-03 1.107e-03 -1.746e-03 2.431e-05 -1.241e-03 -3.273e-03
## 215077 219657 206548 219669 219643 206621
## -3.979e-04 3.924e-05 1.042e-04 1.638e-04 4.253e-04 -1.315e-01
## 206616 206746 206733 206751 219653 206761
## 2.510e-02 -1.043e-03 3.261e-04 4.203e-04 2.091e-04 2.655e-03
## 215062 219670 206723 206571 219655 222171
## -2.807e-04 -2.572e-04 2.017e-02 -2.135e-03 -5.771e-05 -2.362e-03
## 215057 206753 206721 215075 215008 206668
## 5.929e-06 7.010e-04 8.610e-05 -5.738e-04 -3.133e-03 1.212e-03
## 222170 206625 206740 206731 219658 206562
## 5.703e-03 -4.153e-03 -4.332e-04 -9.567e-06 7.130e-04 8.553e-05
## 206757 206569 206673 224844 206627 206667
## 5.838e-04 -9.690e-05 6.812e-04 3.747e-04 -3.023e-01 -1.058e-05
## 206536 214994 206538 215085 215067 215061
## -1.875e-04 -3.904e-04 -7.123e-04 -1.272e-03 -1.971e-06 1.605e-04
## 206726 206570 215055 215049 206615 206712
## 3.788e-04 8.591e-05 -1.732e-03 4.747e-03 2.435e-04 1.068e-03
## 219656 206683 206709 206608 219691 206701
## 1.350e-04 9.079e-04 3.987e-04 4.723e-04 2.376e-04 -4.911e-03
## 206604 215003 206628 219675 206728 206754
## -1.159e-03 1.905e-03 7.612e-03 4.531e-04 -2.634e-04 1.169e-04
## 206718 206657 206670 206655
## -3.262e-03 -4.341e-06 3.950e-05 -7.407e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.047e-01 3.481e-01 1.450 0.384
## V1 1.872e-02 1.985e-02 0.943 0.519
## V2 4.082e-02 7.968e-01 0.051 0.967
## V3 -5.224e+00 8.739e+00 -0.598 0.657
## V4 -7.357e-02 1.008e-01 -0.730 0.599
## V5 -2.501e-02 2.295e-02 -1.090 0.473
## V6 -1.499e-02 3.004e-02 -0.499 0.705
## V7 6.783e-01 8.077e-01 0.840 0.555
## V8 2.201e-01 4.818e-01 0.457 0.727
## V9 -6.887e-02 1.471e-01 -0.468 0.721
## V10 2.948e-01 6.852e-01 0.430 0.741
## V11 1.833e-03 1.582e-03 1.158 0.453
## V12 2.676e-02 1.666e-01 0.161 0.899
## V13 9.918e-02 1.765e-01 0.562 0.674
## V14 3.192e-02 4.487e-02 0.711 0.606
## V15 1.050e-01 1.713e-01 0.613 0.650
## V16 2.615e-02 9.702e-02 0.270 0.832
## V17 2.364e-01 3.575e-01 0.661 0.628
## V18 -3.628e-02 1.644e-01 -0.221 0.862
## V19 -7.510e-02 1.008e-01 -0.745 0.592
## V20 -1.122e-01 1.620e-01 -0.693 0.614
## V21 -1.551e-01 2.398e-01 -0.647 0.634
## V22 6.091e-03 1.739e-02 0.350 0.786
## V23 -3.286e-01 7.537e-01 -0.436 0.738
## V24 -6.455e-01 1.287e+00 -0.502 0.704
## V25 2.150e-02 4.400e-01 0.049 0.969
## V26 2.996e+00 4.691e+00 0.639 0.638
## V27 -5.782e-02 2.646e-01 -0.219 0.863
## V28 -3.728e-01 4.428e-01 -0.842 0.554
## V29 1.351e-01 2.530e-01 0.534 0.688
## V30 1.142e-01 7.585e-01 0.151 0.905
## V31 2.061e-02 3.952e-02 0.521 0.694
## V32 5.053e-02 5.604e-02 0.902 0.533
## V33 3.566e-01 4.017e-01 0.888 0.538
## V34 1.091e-01 2.406e-01 0.453 0.729
## V35 -2.997e-01 3.769e-01 -0.795 0.572
## V36 6.138e-02 1.358e-01 0.452 0.730
## V37 -9.670e-03 4.043e-02 -0.239 0.851
## V38 -1.534e-01 1.732e-01 -0.885 0.539
## V39 1.286e-01 2.366e-01 0.544 0.683
## V40 7.960e-01 1.457e+00 0.546 0.682
## V41 -3.210e-02 3.606e-02 -0.890 0.537
## V42 -1.344e-02 1.306e-02 -1.029 0.491
## V43 -4.239e-02 3.397e-01 -0.125 0.921
## V44 -4.766e-02 7.042e-02 -0.677 0.621
## V45 1.881e-01 5.549e-01 0.339 0.792
## V46 -1.170e-04 3.853e-02 -0.003 0.998
## V47 1.831e-02 2.580e-02 0.710 0.607
## V48 4.621e-02 9.368e-02 0.493 0.708
## V49 7.253e-03 1.044e-02 0.695 0.613
## V50 -2.468e+00 6.081e+00 -0.406 0.755
## V51 -3.890e-02 1.555e-01 -0.250 0.844
## V52 2.619e-01 3.212e-01 0.815 0.565
## V53 1.749e-02 1.068e+00 0.016 0.990
## V54 1.102e-01 1.479e-01 0.745 0.592
## V55 -7.667e-02 8.546e-02 -0.897 0.535
## V56 -1.912e+00 3.084e+00 -0.620 0.647
## V57 8.911e-02 1.194e-01 0.746 0.592
## V58 -3.560e-03 4.892e-03 -0.728 0.600
## V59 9.550e-01 3.599e+00 0.265 0.835
## V60 5.520e-01 1.182e+00 0.467 0.722
## V61 -1.892e-01 2.798e-01 -0.676 0.621
## V62 6.184e-01 9.854e-01 0.628 0.643
## V63 -3.668e-02 4.788e-02 -0.766 0.584
## V64 1.778e-01 3.500e-01 0.508 0.701
## V65 1.421e-01 3.848e-01 0.369 0.775
## V66 -6.777e-02 8.570e-02 -0.791 0.574
## V67 -1.957e-02 8.165e-01 -0.024 0.985
## V68 5.457e-03 2.980e-01 0.018 0.988
## V69 2.952e-03 7.162e-03 0.412 0.751
## V70 5.009e-04 2.166e-02 0.023 0.985
## V71 4.696e-03 2.820e-01 0.017 0.989
## V72 -3.725e-01 4.964e-01 -0.750 0.590
## V73 9.977e-01 1.240e+00 0.805 0.569
## V74 -9.111e-02 1.517e-01 -0.600 0.656
## V75 1.719e+00 2.685e+00 0.640 0.638
## V76 -3.252e-02 3.645e-01 -0.089 0.943
## V77 3.511e-03 7.013e-02 0.050 0.968
## V78 2.287e-03 1.639e-02 0.140 0.912
## V79 -6.067e-02 1.273e-01 -0.477 0.717
## V80 -4.511e-02 5.626e-02 -0.802 0.570
## V81 -8.700e-03 2.764e-02 -0.315 0.806
## V82 4.572e-02 5.401e-02 0.846 0.553
## V83 2.708e-01 2.879e-01 0.941 0.519
## V84 -2.621e-02 1.039e-01 -0.252 0.843
## V85 -3.246e-01 4.105e-01 -0.791 0.574
## V86 6.362e-03 5.708e-01 0.011 0.993
## V87 7.064e-02 3.050e-01 0.232 0.855
## V88 -2.317e-01 5.884e-01 -0.394 0.761
## V89 -5.079e-01 7.192e-01 -0.706 0.609
## V90 1.845e-01 4.836e-01 0.381 0.768
## V91 -8.945e-01 1.227e+00 -0.729 0.599
## V92 4.336e-01 7.239e-01 0.599 0.656
## V93 1.750e-01 2.971e-01 0.589 0.661
## V94 2.183e-02 3.176e-02 0.687 0.617
## V95 -1.067e-02 1.574e-02 -0.678 0.621
## V96 -3.256e-01 4.040e-01 -0.806 0.568
## V97 -1.167e+00 1.646e+00 -0.709 0.607
## V98 -2.792e-03 7.006e-03 -0.399 0.759
## V99 2.471e-01 4.610e-01 0.536 0.687
## V100 1.618e-02 3.220e-02 0.502 0.704
## V101 -8.011e-02 1.541e+00 -0.052 0.967
## V102 7.707e-01 8.468e-01 0.910 0.530
## V103 5.956e-02 2.762e-01 0.216 0.865
## V104 -1.392e+00 1.892e+00 -0.736 0.596
## V105 -3.489e-02 5.602e-02 -0.623 0.645
## V106 -3.612e-01 2.916e+00 -0.124 0.922
## V107 6.630e-01 3.015e+00 0.220 0.862
## V108 3.017e-01 5.299e-01 0.569 0.670
## V109 -1.056e+00 3.018e+00 -0.350 0.786
## V110 -2.554e-01 6.818e-01 -0.375 0.772
## V111 -2.827e-02 1.141e-01 -0.248 0.845
## V112 2.040e-02 6.423e-02 0.318 0.804
## V113 -1.579e-01 7.750e-01 -0.204 0.872
## V114 -4.813e-02 1.646e-01 -0.292 0.819
## V115 -5.676e-02 7.260e-02 -0.782 0.578
## V116 -2.798e-03 2.036e-02 -0.137 0.913
## V117 -1.489e-01 1.849e-01 -0.805 0.568
## V118 -8.592e-02 2.185e-01 -0.393 0.761
## V119 -1.341e-01 3.237e-01 -0.414 0.750
## V120 -2.066e-01 9.665e-01 -0.214 0.866
## V121 -6.144e-02 5.477e-02 -1.122 0.463
## V122 -3.145e-01 5.334e-01 -0.590 0.661
## V123 -4.011e-02 7.234e-02 -0.554 0.678
## V124 -2.170e-02 1.109e-01 -0.196 0.877
## V125 -5.747e-02 6.728e-02 -0.854 0.550
## V126 2.517e-01 2.935e-01 0.857 0.549
## V127 -3.679e-03 6.286e-01 -0.006 0.996
## V128 -1.209e-01 1.485e+00 -0.081 0.948
## V129 1.033e+00 1.081e+00 0.956 0.514
## V130 -1.963e-01 2.333e-01 -0.841 0.555
## V131 5.253e-02 2.944e-01 0.178 0.888
## V132 3.422e-01 5.837e-01 0.586 0.662
## V133 2.312e-02 1.390e-01 0.166 0.895
## V134 8.180e-02 1.079e-01 0.758 0.587
## V135 1.158e-03 1.361e-02 0.085 0.946
## V136 -2.231e-02 1.407e-01 -0.159 0.900
## V137 -5.329e-02 1.267e-01 -0.421 0.746
## V138 9.113e-02 3.293e-01 0.277 0.828
## V139 8.203e-02 2.677e-01 0.306 0.811
## V140 -2.901e-03 2.023e-01 -0.014 0.991
## V141 5.421e-02 1.371e-01 0.395 0.760
## V142 -4.619e-02 4.777e-01 -0.097 0.939
## V143 1.224e-02 9.505e-02 0.129 0.918
## V144 -1.097e-01 1.265e-01 -0.868 0.545
## V145 2.968e-01 4.063e-01 0.731 0.598
## V146 6.652e-01 8.713e-01 0.764 0.585
## V147 1.342e-01 1.402e-01 0.957 0.514
## V148 1.462e-02 1.079e-01 0.135 0.914
## V149 -2.548e-01 1.572e+00 -0.162 0.898
## V150 -2.229e-01 2.544e-01 -0.876 0.542
## V151 -7.650e-03 1.779e-02 -0.430 0.741
## V152 -4.233e-01 5.932e-01 -0.714 0.605
## V153 -3.781e-05 3.592e-04 -0.105 0.933
## V154 -3.968e-01 4.256e-01 -0.932 0.522
## V155 2.090e-01 4.801e-01 0.435 0.739
## V156 1.706e-02 2.743e-02 0.622 0.646
## V157 4.356e-02 8.333e-02 0.523 0.693
## V158 1.101e-01 4.528e-01 0.243 0.848
## V159 4.801e-01 7.219e-01 0.665 0.626
## V160 -4.757e-01 1.641e+00 -0.290 0.820
## V161 5.233e-01 8.757e-01 0.598 0.657
## V162 3.020e-03 5.066e-03 0.596 0.658
## V163 5.783e-03 1.185e-01 0.049 0.969
## V164 -1.931e-02 3.056e-02 -0.632 0.641
## V165 -6.805e-02 1.299e-01 -0.524 0.693
## V166 3.119e-01 7.351e-01 0.424 0.745
## V167 -6.166e-01 8.672e-01 -0.711 0.607
## V168 -2.348e-02 3.896e-01 -0.060 0.962
## V169 1.105e-01 2.096e-01 0.527 0.691
## V170 -9.663e-02 2.449e-01 -0.395 0.761
## V171 -5.896e-02 8.232e-02 -0.716 0.604
## V172 1.682e-01 3.031e-01 0.555 0.677
## V173 -3.735e+00 3.061e+00 -1.220 0.437
## V174 -5.518e-02 2.347e-01 -0.235 0.853
## V175 1.447e-01 9.161e-01 0.158 0.900
## V176 1.739e-02 7.549e-02 0.230 0.856
##
## Residual standard error: 0.5887 on 1 degrees of freedom
## Multiple R-squared: 0.9897, Adjusted R-squared: -0.8242
## F-statistic: 0.5456 on 176 and 1 DF, p-value: 0.8225
In this model, I tried to fit the phenotype to microbes with high counts but it seems that it is not a good model.
We can also try poisson model for this counts data:
fit <- glm(isPatient ~ . , data = microbeCountsWithPheno , family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit)
##
## Call:
## glm(formula = isPatient ~ ., family = "binomial", data = microbeCountsWithPheno)
##
## Deviance Residuals:
## 206646 224324 206619 224326 206624 219644
## 2.408e-06 2.404e-06 2.417e-06 2.409e-06 2.186e-06 2.413e-06
## 214995 215058 206750 206617 206626 219638
## 2.416e-06 2.408e-06 2.403e-06 2.345e-06 1.802e-06 2.390e-06
## 224330 224328 206710 206676 206762 206635
## 2.403e-06 2.373e-06 2.417e-06 2.414e-06 2.370e-06 2.408e-06
## 206702 206769 206534 215007 206678 206752
## 2.448e-06 2.282e-06 2.415e-06 2.407e-06 2.384e-06 2.395e-06
## 206675 206572 219652 219693 206738 215048
## 2.404e-06 2.414e-06 2.403e-06 -2.401e-06 -2.412e-06 2.409e-06
## 215076 206720 206630 206758 219668 206669
## 2.422e-06 2.414e-06 2.400e-06 2.397e-06 2.519e-06 3.302e-06
## 206741 206659 215056 206684 206564 206732
## 2.411e-06 2.406e-06 2.420e-06 2.397e-06 2.426e-06 2.409e-06
## 206768 219654 206648 206561 206672 206747
## 2.457e-06 2.406e-06 2.404e-06 2.406e-06 2.424e-06 2.438e-06
## 219637 206681 219633 206643 206671 206695
## 7.832e-06 2.387e-06 2.412e-06 2.422e-06 2.382e-06 2.413e-06
## 215010 215074 206766 206704 206682 206605
## 2.409e-06 2.425e-06 2.429e-06 -2.394e-06 -2.597e-06 -2.395e-06
## 206660 206620 215080 206756 219659 215009
## -2.408e-06 -3.120e-06 -2.401e-06 2.404e-06 2.435e-06 -2.409e-06
## 219676 206770 206656 206645 206730 206700
## 2.411e-06 2.410e-06 2.397e-06 2.459e-06 -2.372e-06 -2.409e-06
## 215084 206713 206725 219651 206729 206734
## 2.446e-06 2.402e-06 -2.415e-06 -2.410e-06 2.406e-06 2.404e-06
## 206727 219646 215004 206629 206614 206719
## 2.410e-06 2.393e-06 2.176e-06 -2.412e-06 -2.450e-06 -2.365e-06
## 206547 206724 206622 206603 206563 206739
## -2.409e-06 -2.409e-06 -2.396e-06 2.426e-06 2.409e-06 2.406e-06
## 206636 215023 206743 224845 219634 224327
## 2.265e-06 2.408e-06 2.403e-06 2.412e-06 2.405e-06 2.460e-06
## 206623 219692 206658 224325 206742 206618
## 2.666e-06 1.433e-06 2.409e-06 2.408e-06 2.411e-06 2.400e-06
## 206609 206708 206647 219645 215050 224323
## 2.391e-06 -2.427e-06 -2.407e-06 2.420e-06 2.409e-06 2.408e-06
## 206644 206755 206677 206711 206703 206767
## 2.380e-06 2.430e-06 2.375e-06 2.409e-06 2.385e-06 2.345e-06
## 215077 219657 206548 219669 219643 206621
## 2.401e-06 2.409e-06 2.411e-06 2.412e-06 2.417e-06 2.110e-08
## 206616 206746 206733 206751 219653 206761
## 2.882e-06 2.388e-06 2.415e-06 2.417e-06 2.413e-06 2.460e-06
## 215062 219670 206723 206571 219655 222171
## 2.403e-06 2.404e-06 2.791e-06 2.367e-06 -2.410e-06 -2.454e-06
## 215057 206753 206721 215075 215008 206668
## -2.409e-06 -2.395e-06 -2.407e-06 -2.420e-06 -2.469e-06 -2.385e-06
## 222170 206625 206740 206731 219658 206562
## -2.297e-06 -2.489e-06 -2.417e-06 -2.409e-06 2.422e-06 2.410e-06
## 206757 206569 206673 224844 206627 206667
## 2.420e-06 2.407e-06 2.422e-06 -2.401e-06 -6.551e-06 -2.409e-06
## 206536 214994 206538 215085 215067 215061
## -2.412e-06 -2.416e-06 -2.422e-06 2.384e-06 2.409e-06 2.412e-06
## 206726 206570 215055 215049 206615 206712
## 2.416e-06 2.410e-06 2.375e-06 2.500e-06 2.413e-06 -2.388e-06
## 219656 206683 206709 206608 219691 206701
## -2.406e-06 2.426e-06 2.416e-06 2.418e-06 2.413e-06 -2.503e-06
## 206604 215003 206628 219675 206728 206754
## -2.431e-06 -2.372e-06 -2.260e-06 2.417e-06 2.403e-06 2.411e-06
## 206718 206657 206670 206655
## 2.345e-06 2.409e-06 2.409e-06 2.394e-06
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.844e+01 9.692e+04 0 1
## V1 2.041e+00 8.350e+03 0 1
## V2 -7.994e+01 4.744e+05 0 1
## V3 -1.263e+03 5.185e+06 0 1
## V4 -5.777e+00 4.603e+04 0 1
## V5 -1.852e+00 1.380e+04 0 1
## V6 -1.366e+00 2.367e+04 0 1
## V7 3.222e+01 5.158e+05 0 1
## V8 7.517e+01 3.015e+05 0 1
## V9 2.445e+00 7.433e+04 0 1
## V10 5.108e+01 3.468e+05 0 1
## V11 9.536e-02 9.843e+02 0 1
## V12 -1.634e+01 1.066e+05 0 1
## V13 -3.266e+00 9.514e+04 0 1
## V14 1.645e+00 1.961e+04 0 1
## V15 5.982e+00 7.572e+04 0 1
## V16 3.125e-01 5.285e+04 0 1
## V17 3.209e+01 1.567e+05 0 1
## V18 1.012e+01 8.913e+04 0 1
## V19 -2.788e+00 5.055e+04 0 1
## V20 -1.038e+01 6.792e+04 0 1
## V21 -2.049e+01 1.099e+05 0 1
## V22 -9.323e-01 9.902e+03 0 1
## V23 -1.214e+02 4.966e+05 0 1
## V24 -6.496e+01 5.961e+05 0 1
## V25 -5.246e+01 2.902e+05 0 1
## V26 5.787e+02 2.466e+06 0 1
## V27 2.506e+01 1.765e+05 0 1
## V28 -2.833e+01 1.812e+05 0 1
## V29 1.493e+01 1.386e+05 0 1
## V30 1.102e+02 5.454e+05 0 1
## V31 -1.608e+00 2.421e+04 0 1
## V32 5.565e+00 3.528e+04 0 1
## V33 3.817e+01 1.686e+05 0 1
## V34 2.335e+01 1.283e+05 0 1
## V35 -1.554e+01 1.667e+05 0 1
## V36 1.583e+01 7.057e+04 0 1
## V37 -6.356e+00 3.609e+04 0 1
## V38 -1.628e+01 1.116e+05 0 1
## V39 -5.504e+00 1.212e+05 0 1
## V40 -3.270e+01 8.895e+05 0 1
## V41 -7.148e-01 2.062e+04 0 1
## V42 -1.759e+00 8.735e+03 0 1
## V43 2.922e+01 1.994e+05 0 1
## V44 -7.580e+00 3.124e+04 0 1
## V45 1.608e+01 2.950e+05 0 1
## V46 2.744e+00 2.147e+04 0 1
## V47 1.977e+00 1.761e+04 0 1
## V48 1.493e+01 5.997e+04 0 1
## V49 -3.724e-01 1.028e+04 0 1
## V50 -6.939e+02 3.077e+06 0 1
## V51 1.622e+01 1.160e+05 0 1
## V52 2.799e+00 2.227e+05 0 1
## V53 -9.466e-01 6.097e+05 0 1
## V54 1.985e+01 7.242e+04 0 1
## V55 -1.132e+01 4.889e+04 0 1
## V56 -2.369e+02 1.461e+06 0 1
## V57 9.228e+00 5.354e+04 0 1
## V58 -3.456e-01 1.992e+03 0 1
## V59 4.100e+02 2.084e+06 0 1
## V60 1.475e+02 6.388e+05 0 1
## V61 -3.185e+01 1.217e+05 0 1
## V62 1.368e+02 5.454e+05 0 1
## V63 -3.313e+00 1.997e+04 0 1
## V64 1.765e+01 2.026e+05 0 1
## V65 -2.785e+01 2.543e+05 0 1
## V66 -3.653e+00 4.044e+04 0 1
## V67 -7.283e+01 4.431e+05 0 1
## V68 -3.089e+01 1.854e+05 0 1
## V69 7.747e-01 3.554e+03 0 1
## V70 1.618e+00 1.227e+04 0 1
## V71 -3.073e+01 1.777e+05 0 1
## V72 -4.888e+01 2.082e+05 0 1
## V73 1.904e+02 6.749e+05 0 1
## V74 -2.614e+01 1.042e+05 0 1
## V75 3.877e+02 1.574e+06 0 1
## V76 -3.582e+00 1.702e+05 0 1
## V77 -1.019e+00 3.644e+04 0 1
## V78 1.153e+00 1.101e+04 0 1
## V79 -2.041e+01 8.151e+04 0 1
## V80 -5.718e+00 3.616e+04 0 1
## V81 -1.607e+00 1.407e+04 0 1
## V82 6.227e+00 2.684e+04 0 1
## V83 6.004e+01 2.555e+05 0 1
## V84 -1.372e+01 6.776e+04 0 1
## V85 -5.393e+01 2.671e+05 0 1
## V86 4.045e+01 3.202e+05 0 1
## V87 5.403e+01 2.942e+05 0 1
## V88 -2.659e+00 3.069e+05 0 1
## V89 -5.010e+01 3.159e+05 0 1
## V90 7.414e+01 3.466e+05 0 1
## V91 -1.960e+02 7.172e+05 0 1
## V92 1.134e+02 4.407e+05 0 1
## V93 2.010e+01 1.241e+05 0 1
## V94 2.206e+00 1.413e+04 0 1
## V95 -2.196e+00 8.003e+03 0 1
## V96 -3.470e+01 1.607e+05 0 1
## V97 -3.239e+01 7.751e+05 0 1
## V98 3.668e-01 4.153e+03 0 1
## V99 2.710e+01 2.055e+05 0 1
## V100 4.685e+00 1.885e+04 0 1
## V101 -3.934e+01 9.826e+05 0 1
## V102 1.141e+02 4.009e+05 0 1
## V103 2.855e+01 1.448e+05 0 1
## V104 -2.291e+02 1.095e+06 0 1
## V105 4.453e-01 2.791e+04 0 1
## V106 1.954e+02 1.650e+06 0 1
## V107 3.687e+02 1.833e+06 0 1
## V108 7.338e+01 2.824e+05 0 1
## V109 -3.364e+02 1.597e+06 0 1
## V110 -8.037e+01 4.263e+05 0 1
## V111 7.636e+00 6.554e+04 0 1
## V112 2.879e+00 4.327e+04 0 1
## V113 5.434e+01 5.081e+05 0 1
## V114 7.629e+00 9.206e+04 0 1
## V115 -1.260e+01 4.445e+04 0 1
## V116 5.906e-01 1.093e+04 0 1
## V117 -2.052e+01 9.720e+04 0 1
## V118 -7.760e+00 1.266e+05 0 1
## V119 -2.805e+01 2.252e+05 0 1
## V120 -1.342e+02 8.123e+05 0 1
## V121 -7.992e+00 3.128e+04 0 1
## V122 -7.769e+01 3.084e+05 0 1
## V123 -1.240e+01 4.983e+04 0 1
## V124 -9.041e+00 6.136e+04 0 1
## V125 -7.375e+00 3.240e+04 0 1
## V126 3.864e+01 1.487e+05 0 1
## V127 -5.307e+01 3.449e+05 0 1
## V128 -1.897e+02 9.485e+05 0 1
## V129 1.203e+02 4.935e+05 0 1
## V130 -2.823e+01 1.177e+05 0 1
## V131 2.537e+01 1.877e+05 0 1
## V132 2.760e+01 2.862e+05 0 1
## V133 1.759e+01 8.267e+04 0 1
## V134 3.225e+00 5.659e+04 0 1
## V135 1.646e+00 8.424e+03 0 1
## V136 -1.949e+01 9.529e+04 0 1
## V137 -1.590e+01 7.325e+04 0 1
## V138 3.222e+01 1.963e+05 0 1
## V139 3.442e+00 1.649e+05 0 1
## V140 7.942e+00 1.163e+05 0 1
## V141 8.626e+00 7.182e+04 0 1
## V142 4.891e+01 2.945e+05 0 1
## V143 -8.472e-01 4.575e+04 0 1
## V144 -1.639e+01 6.956e+04 0 1
## V145 6.146e+01 2.232e+05 0 1
## V146 5.783e+01 3.801e+05 0 1
## V147 1.398e+01 7.253e+04 0 1
## V148 2.813e+00 4.583e+04 0 1
## V149 -1.879e+02 9.622e+05 0 1
## V150 -3.077e+01 1.203e+05 0 1
## V151 3.245e-01 1.343e+04 0 1
## V152 -4.788e+01 2.685e+05 0 1
## V153 -2.394e-03 2.032e+02 0 1
## V154 -7.926e+01 2.940e+05 0 1
## V155 5.754e+01 2.471e+05 0 1
## V156 1.008e+00 1.261e+04 0 1
## V157 6.938e-01 4.129e+04 0 1
## V158 9.834e+00 2.763e+05 0 1
## V159 7.167e+01 3.363e+05 0 1
## V160 -2.009e+02 9.652e+05 0 1
## V161 9.311e+01 4.566e+05 0 1
## V162 -7.110e-02 2.659e+03 0 1
## V163 1.204e+01 6.915e+04 0 1
## V164 -3.634e+00 1.670e+04 0 1
## V165 -6.824e-02 6.414e+04 0 1
## V166 -1.550e+01 3.821e+05 0 1
## V167 -1.358e+01 3.866e+05 0 1
## V168 -4.563e+01 2.780e+05 0 1
## V169 4.302e+01 2.157e+05 0 1
## V170 -3.812e+01 1.729e+05 0 1
## V171 -7.664e+00 3.411e+04 0 1
## V172 3.777e+01 1.669e+05 0 1
## V173 -5.254e+02 2.033e+06 0 1
## V174 1.675e+01 1.412e+05 0 1
## V175 8.691e+01 4.866e+05 0 1
## V176 -5.324e+00 4.237e+04 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.0128e+02 on 177 degrees of freedom
## Residual deviance: 1.1260e-09 on 1 degrees of freedom
## AIC: 354
##
## Number of Fisher Scoring iterations: 25
It is awful and not working. So we can forget using linear model because of two reasons: 1- To use linear models, you should have more samples than features. In order to overcome this problem, I tried to select microbes with high count. 2- The fitted model is awful! All pvalues are high and model description is not good.
boxplot(microbeCounts)
A possible approach is to normalize counts with respect to samples. Maybe counts are higher for samples. I propose a normalization technique with l2 norm. It is as follows:
NormalizeL2Sqr <- function(iData){
require(dplyr)
sqrData <- sqrt(iData)
sqrData <- as.data.frame(sqrData)
iData %>%
colSums(na.rm = T) %>%
sqrt() -> sampleSum
if(length(which(sampleSum == 0)) != 0){
sqrData <- sqrData[,-which(sampleSum == 0)] # removing rows that their sum is zero!
sampleSum <- sampleSum[-which(sampleSum == 0)]
warning("Some samples counts values are zero")
}
# removing genes that are not expressed!
# What should I do??? ***
# Too slow!
# sqrData <- apply(sqrData , 2 , function(x) x/sqrt(sum(x^2)))
if(ncol(sqrData) == 0){
stop("All of the microbes's count in samples is zero")
}
else{
tmp <- lapply(1:ncol(sqrData) , function(i) sqrData[,i] <<- (sqrData[,i]/sampleSum[i]))
remove(tmp)
sqrData
}
}
Each element is divided by a factor: \[ \frac{x_i}{\Sigma x_j^2}\] Now using this method on the data:
microbeCountsNormalized <- NormalizeL2Sqr(microbeCounts)
pheatmap(microbeCountsNormalized)
Its heatmaps looks better than before!
Now using pca on samples again and see if they can be classified easily:
pca <- prcomp(microbeCountsNormalized , center = F , scale. = F)
pcr <- data.frame(pca$r[,1:3] , group = groups$isPatient)
ggplot(pcr) +
geom_point(aes(x = PC1 , y = PC2, color = group)) + theme_bw()
No! Let’s use tSNE:
rtsneObj <- Rtsne::Rtsne(X = t(as.matrix(microbeCountsNormalized)) , dim = 2 , perplexity = 5 , check_duplicates = F,
pca_center = F , pca_scale = F)
tsneCoords <- as.data.frame(rtsneObj$Y)
colnames(tsneCoords) <- c("X" , "Y")
tsneCoords$group <- groups$isPatient
ggplot(as.data.frame(tsneCoords)) +
geom_point(aes(x = X , y = Y, color = group)) +
ggtitle("tSNE on Microbes")
It seems unsatisfiable. If we use simple k-means clustering it would have multiple mislabeling.
We’ve tested multiple methods but there is anyway a goodway called multiple hypothesis testing to see the differences among the groups. We can We can implement our usual t-test:
tTest <- function(x1 , x2){
x1.mean <- mean(x1)
x2.mean <- mean(x2)
x1.sd <- sd(x1)
x2.sd <- sd(x2)
df <- length(x1) + length(x2) - 2
# if the variances are somehow equal, pooled variance will be:
sps = ((length(x1) - 1) * x1.sd^2 + (length(x2) - 1) * x2.sd^2) / df
tstat= abs(x1.mean - x2.mean) / (sqrt(sps) * sqrt(1/length(x1) + 1/length(x2)))
pt(tstat , df , lower.tail = F) * 2
# t.test in R is two sided
}
groups %>%
filter(isPatient == 1) -> tmp
microbeCountsPatients <- microbeCounts[,tmp$`External ID`]
patientsColumn <- dim(microbeCountsPatients)[2]
#dim(microbeCounts)
groups %>%
filter(isPatient == 0) -> tmp
microbeCountsHealthy <- microbeCounts[,tmp$`External ID`]
orderedDataSet <- cbind(microbeCountsPatients , microbeCountsHealthy)
tTestDEA <- apply(orderedDataSet , 1 , function(i) {
x1 <- i[1:patientsColumn]
x2 <- i[(patientsColumn + 1):ncol(orderedDataSet)]
tTest(x1 =x1 , x2 = x2)
})
DEATtest <- data.frame(MicrobeName = names(tTestDEA) , P_value = tTestDEA , stringsAsFactors = F)
DEATtest <- subset(DEATtest , P_value < 0.05)
dim(DEATtest)
## [1] 112 2
We can see that 112 microbes are selected via a simple multiple hypothesis testing. But an important problem in multiple hypothesis testing is correcting p.values. I first try Bonferroni correction to see and have a sketch of pvalues:
DEATtestB <- subset(DEATtest , P_value < (0.05 / dim(DEATtest)[1]))
dim(DEATtestB)
## [1] 5 2
DEATtestB$MicrobeName
## [1] "Unc03y4v" "Unc36622" "Unc81447" "Unc38399" "Unc00y95"
Only five microbes remained. We can also try FDR correction:
tTestDEA <- apply(orderedDataSet , 1 , function(i) {
x1 <- i[1:patientsColumn]
x2 <- i[(patientsColumn + 1):ncol(orderedDataSet)]
tTest(x1 =x1 , x2 = x2)
})
DEATtest <- data.frame(MicrobeName = names(tTestDEA) , P_value = tTestDEA , stringsAsFactors = F)
adjustedPValues <- p.adjust(DEATtest$P_value , method = "fdr")
DEATtestFDR <- DEATtest
DEATtestFDR$P_value <- adjustedPValues
subset(DEATtestFDR , P_value < 0.05)
## MicrobeName P_value
## Unc03y4v Unc03y4v 0.0072747433
## Unc36622 Unc36622 0.0011250931
## Unc81447 Unc81447 0.0299726689
## Unc38399 Unc38399 0.0271134489
## Unc00y95 Unc00y95 0.0009854847
selectedMicrobes.ttest <- DEATtestFDR %>%
filter(P_value < 0.05)
selectedMicrobes.ttest <- selectedMicrobes.ttest$MicrobeName
Only five microbes are remained. And obviously, this genes are similar to the one selected with Bonferroni method:
intersect(DEATtestB$MicrobeName , selectedMicrobes.ttest)
## [1] "Unc03y4v" "Unc36622" "Unc81447" "Unc38399" "Unc00y95"
We can also use permutation test instead of t-test because of forgetting normality assumption:
orderedDataSet$microbeCounts <- rowSums(orderedDataSet)
tmpRowNames <- rownames(orderedDataSet)
zeroMicrobes <- rownames(orderedDataSet[which(orderedDataSet$microbeCounts == 0),])
orderedDataSet %>%
filter(microbeCounts != 0) -> orderedDataSet
dim(orderedDataSet)
## [1] 959 179
orderedDataSet %>%
select(-c(microbeCounts)) -> orderedDataSet
tmpRowNames <- setdiff(tmpRowNames , zeroMicrobes)
rownames(orderedDataSet) <- tmpRowNames
permTest <- apply(orderedDataSet , 1 , function(i) {
x1 <- i[1:patientsColumn]
x2 <- i[(patientsColumn + 1):ncol(orderedDataSet)]
x1df <- data.frame(class = "patient" , counts = x1)
x2df <- data.frame(class = "healthy" , counts = x2)
tmp <- rbind(x1df , x2df)
coin::oneway_test(counts ~ class , data = tmp , distribution = approximate(B=1000) , )
})
DEAPermTest <- data.frame()
for(i in 1:length(permTest)){
DEAPermTest <- rbind(DEAPermTest , data.frame(MicrobeName = rownames(orderedDataSet)[i], P_value = coin::pvalue(permTest[[i]])[1]))
}
head(DEAPermTest)
## MicrobeName P_value
## 1 IP8BSoli 0.825
## 2 UncTepi3 0.678
## 3 Unc004ii 0.823
## 4 Unc00re8 0.582
## 5 Unc018j2 0.344
## 6 Unc04u81 0.944
adjustedPValues <- p.adjust(DEAPermTest$P_value , method = "fdr")
DEAPermTestFDR <- DEAPermTest
DEAPermTestFDR$P_value <- adjustedPValues
subset(DEAPermTestFDR , P_value < 0.05)
## MicrobeName P_value
## 118 Unc03exo 0
## 200 Unc02ruj 0
## 463 Unc03uuu 0
## 524 Unc01pk4 0
## 565 Unc05n7t 0
## 615 Unc03tc5 0
## 683 Unc03y4v 0
## 732 Unc36622 0
## 790 Unc38399 0
## 804 Unc01zba 0
## 901 Unc00y95 0
selectedMicrobes <- DEAPermTestFDR %>%
filter(P_value < 0.05)
selectedMicrobes.perm <- selectedMicrobes$MicrobeName
Good! I think I found 8 microbes that are very good in distinguishing the two groups. Let’s see if there is any similarities between t-test results and permutation test results:
intersect(selectedMicrobes.ttest , selectedMicrobes.perm)
## [1] "Unc03y4v" "Unc36622" "Unc38399" "Unc00y95"
So the permuation test results is a super set of t-test results. We can see whether k-means after pca can separate the samples.
selectedMicrobes <- union(selectedMicrobes.perm , selectedMicrobes.ttest)
orderedDataSet.selected <- orderedDataSet[selectedMicrobes , ]
pca <- prcomp(orderedDataSet.selected , center = F , scale. = F)
pcr <- data.frame(pca$r[,1:3] , group = groups$isPatient)
ggplot(pcr) +
geom_point(aes(x = PC1 , y = PC2, color = group)) + theme_bw()
Anyway it is not satisfying. After that we can see the logstic regression performance for the selected microbes vs phenotype.
t.orderedDataSet.selected <- as.data.frame(t(orderedDataSet.selected))
t.orderedDataSet.selected$isPatient <- groups$isPatient
fit <- glm(isPatient ~ . , data = t.orderedDataSet.selected , family = "binomial")
summary(fit)
##
## Call:
## glm(formula = isPatient ~ ., family = "binomial", data = t.orderedDataSet.selected)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.22049 0.01304 0.57088 0.64472 1.42915
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.7454445 0.2703480 6.456 1.07e-10 ***
## Unc03exo 0.0142904 0.0412711 0.346 0.72915
## Unc02ruj -0.0019878 0.0020264 -0.981 0.32662
## Unc03uuu -0.5589619 0.6035458 -0.926 0.35438
## Unc01pk4 -0.0016021 0.0005301 -3.022 0.00251 **
## Unc05n7t -0.0113908 0.0367798 -0.310 0.75679
## Unc03tc5 0.0185709 0.0206657 0.899 0.36885
## Unc03y4v 0.0042450 0.0058668 0.724 0.46933
## Unc36622 0.0006170 0.0026200 0.235 0.81384
## Unc38399 -0.0031163 0.0128288 -0.243 0.80807
## Unc01zba -0.0810254 0.1714104 -0.473 0.63643
## Unc00y95 -0.0012842 0.0006792 -1.891 0.05866 .
## Unc81447 0.6295148 0.4768830 1.320 0.18681
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 201.28 on 177 degrees of freedom
## Residual deviance: 169.62 on 165 degrees of freedom
## AIC: 195.62
##
## Number of Fisher Scoring iterations: 6
Hmm, we can see that the intersect has a little p-value so I think this model is not so good for distinguishing between two groups. Maybe we can use tSNE and visualize better:
rtsneObj <- Rtsne::Rtsne(X = t.orderedDataSet.selected , dim = 2 , perplexity = 5 , check_duplicates = F,
pca_center = F , pca_scale = F)
tsneCoords <- as.data.frame(rtsneObj$Y)
colnames(tsneCoords) <- c("X" , "Y")
tsneCoords$group <- groups$isPatient
ggplot(as.data.frame(tsneCoords)) +
geom_point(aes(x = X , y = Y, color = group)) +
ggtitle("tSNE on Microbes")
It is not that amazing, but it is better than previous parts and I think that we selected better signals. Because the first one contained a lot of noise and that noise made our clustering worse. But if we normalize the data and try tSNE again:
rtsneObj <- Rtsne::Rtsne(X = t.orderedDataSet.selected , dim = 2 , perplexity = 5 , check_duplicates = F,
pca_center = T , pca_scale = T)
tsneCoords <- as.data.frame(rtsneObj$Y)
colnames(tsneCoords) <- c("X" , "Y")
tsneCoords$group <- groups$isPatient
result <- ggplot(as.data.frame(tsneCoords)) +
geom_point(aes(x = X , y = Y, color = group)) +
ggtitle("tSNE on Microbes")
result
Wow! amazing. I think we can separate the samples now. Although normal samples are not so close to each other, but patients samples are away from normal samples and simple k-means could have less mislabeling:
kCluster <- kmeans(t.orderedDataSet.selected , centers = 2 , nstart = 20 , iter.max = 10000)
kCluster$cluster
## 206615 206614 206617 206619 206616 206618 206621 206622
## 1 1 1 1 1 1 1 1
## 206620 206624 206623 206626 206625 206628 206627 206630
## 1 1 1 1 1 1 1 1
## 206629 206636 206635 206644 206643 206645 206646 206648
## 1 1 1 1 1 1 1 1
## 206647 206656 206655 206659 206660 206667 206668 206670
## 1 1 1 1 1 1 1 1
## 206669 206671 206672 206672.1 206673 206676 206675 206677
## 1 1 1 1 1 1 1 1
## 206678 206681 206682 206684 206683 206695 206700 206757
## 1 1 1 1 1 1 1 1
## 206758 206702 206701 206708 206709 206746 206747 206713
## 1 1 1 1 1 1 1 1
## 206712 206720 206721 206723 206724 206728 206727 206732
## 1 1 1 1 1 1 1 1
## 206731 206754 206734 206733 206750 206751 206756 206755
## 1 1 1 1 1 1 1 1
## 206762 206761 206767 206766 206769 206768 206770 224330
## 1 1 1 1 1 1 1 1
## 224327 224328 224325 224326 206536 206538 206534 215050
## 1 1 1 1 1 1 1 1
## 215023 206548 206547 206562 206561 206564 206563 206570
## 1 1 1 1 1 1 1 1
## 206569 215067 206571 206603 206572 206604 206605 206608
## 1 1 1 1 1 1 1 1
## 206609 215085 215084 214995 214994 215061 215062 215075
## 1 1 1 1 1 1 1 1
## 215076 215080 219633 219634 219637 219638 219643 219644
## 1 1 1 1 1 1 1 1
## 219646 219645 219659 219653 219654 219693 219668 219670
## 1 1 1 1 1 1 1 1
## 219669 219675 219676 219691 219692 206657 206658 206704
## 1 1 1 1 1 1 1 1
## 206703 206753 206752 206711 206710 206719 206726 206725
## 1 1 1 1 1 2 1 1
## 206730 206729 206739 206738 206741 206740 206742 206743
## 1 1 1 1 2 2 1 1
## 224324 224323 215004 215003 215008 215007 215010 215009
## 1 1 1 1 1 1 1 1
## 215049 215048 215056 215055 215058 215057 215074 215077
## 1 1 1 1 1 1 1 1
## 222170 222171 224844 224845 219652 219651 219656 219655
## 1 1 1 1 1 1 1 1
## 219657 219658
## 1 1
sum(kCluster$cluster == 2)
## [1] 3
sum(kCluster$cluster == 1)
## [1] 175
It is not satisfying because we have we have 133 samples from one class and 45 samples from the other class. So we may use another algorithm for clustering.
We also can use another clustering technique that is useful when having a graph. There is a good idea to: 1- dimensionality reduction with diffusion maps. 2- Separating the points with DC1. Because of eigen values, we can say that samples that are before DC1 \(= 0\), could be from the other group. This method could be very useful for clustering. But we should see the results to decide between them.
dm2 <- DiffusionMap(data = t(as.matrix(orderedDataSet)) , n_eigs = 2, density_norm = F , distance = "euclidean", k = 20)
## Warning in dataset_extract_doublematrix(data, vars): Duplicate rows removed
## from data. Consider explicitly using `df[!duplicated(df), ]`
# Having duplicates in expression and removing them
tmp <- orderedDataSet[!duplicated(t(as.matrix(orderedDataSet))),]
tmpSampName <- setdiff(colnames(tmp) , groups$`External ID`)
groupTmp <- groups %>% filter(`External ID` %in% colnames(tmp))
groupTmp <- groupTmp[!duplicated(groupTmp),]
tmp <- tmp %>%
select(-c("206672.1"))
dcf <- data.frame(DC1 = eigenvectors(dm2)[,1] , DC2 = eigenvectors(dm2)[,2] ,
name = colnames(tmp) , gr = groupTmp$isPatient)
p <- ggplot(dcf) +
geom_point(aes(x = DC1 , y = DC2 , color = gr))
p
So it is not that satisfying but if we subset the data and reduce our features to the selected microbes,
dm2 <- DiffusionMap(data = t.orderedDataSet.selected , n_eigs = 2, density_norm = F , distance = "euclidean", k = 20)
## Warning in dataset_extract_doublematrix(data, vars): Duplicate rows removed
## from data. Consider explicitly using `df[!duplicated(df), ]`
# Having duplicates in expression and removing them
tmp <- t.orderedDataSet.selected[!duplicated(t.orderedDataSet.selected),]
tmpSampName <- setdiff(rownames(tmp) , groups$`External ID`)
groupTmp <- groups %>% filter(`External ID` %in% rownames(tmp))
ids <- unique(groups$`External ID`)
groupTmp <- groupTmp[!duplicated(groupTmp),]
dcf <- data.frame(DC1 = eigenvectors(dm2)[,1] , DC2 = eigenvectors(dm2)[,2] ,
name = rownames(tmp) , gr = groupTmp$isPatient)
p <- ggplot(dcf) +
geom_point(aes(x = DC1 , y = DC2 , color = gr))
p
So it seems that it is not a good method for this work.
So I think the list of responsible signals is:
selectedMicrobes
## [1] "Unc03exo" "Unc02ruj" "Unc03uuu" "Unc01pk4" "Unc05n7t" "Unc03tc5"
## [7] "Unc03y4v" "Unc36622" "Unc38399" "Unc01zba" "Unc00y95" "Unc81447"
This microbes could be validated in laboratory but I think with a data analysis approach this is was a good approach.
I tried a lot of ways and changed workflow multiple times to find the responsible features(microbes). I tried several normalization techniques like the simple one used in PCA or the method that I defined that was a l2 norm with square root. After that I tried multiple models like linear model or logsitics regression. I also used dimension reduction techniques like PCA or tSNE to visualize the data and see whether we could use simple k-means afterwards. I also tried a method named “Normalized-cut” and implemented it in a simple way using diffusion maps. They models were not so useful. Finally, I found my way in using multiple hypothesis testing. I tried t-test and permutation test. I adjusted the p-values with fdr method and merged two microbe sets that were defined by selecting p-values below 0.05. Then I tried to do the visualization with this method. First try was without normalization that was unsuccessful but the next try was with normalization that I centered and scaled the samples, and used tSNE. The results were satisfying. The resulted plot is as follows:
result
In this plot, I think that we can distinguish between normal and patient samples. Normal samples are defined by black color and patient samples are defined with blue. Although the patinets samples are not too close to each other, we can distinguish between normal and patient samples. So I think the selected microbes were good features for distinguishing between normal and patient samples.